Probe by Gaming: A Game-based Benchmark for Assessing Conceptual Knowledge in LLMs

Abstract

Concepts represent generalized abstractions that enable humans to categorize and reason efffciently, yet it is unclear to what extent Large Language Models (LLMs) comprehend these semantic relationships. Existing benchmarks typically focus on factual recall and isolated tasks, failing to evaluate the ability of LLMs to understand conceptual boundaries. To address this gap, we introduce CK-Arena, a multi-agent interaction game built upon the Undercover game, designed to evaluate the capacity of LLMs to reason with concepts in interactive settings. CK-Arena challenges models to describe, differentiate, and infer conceptual boundaries based on partial information, encouraging models to explore commonalities and distinctions between closely related concepts. By simulating real-world interaction, CK-Arena provides a scalable and realistic benchmark for assessing conceptual reasoning in dynamic environments. Experimental results show that LLMs' understanding of conceptual knowledge varies signiffcantly across different categories and is not strictly aligned with parameter size or general model capabilities.

CK-Arena Demo: Undercover Game

Below is an interactive demonstration of the Undercover game used in CK-Arena. We first introduce the basic game rules to help you understand, and then demonstrate the interaction of intelligent agents in the first round of a real experiment. In this game, LLM agents are assigned either the main concept ("bee") or an undercover concept ("butterfly"). Players take turns making statements about their concept without revealing it directly. The goal for the civilians is to identify and eliminate the undercover agents through voting, while undercover agents try to blend in without being detected.

Game Flow :
1. Role Assignment: Players are randomly assigned as civilians or undercover agents, each receiving a similar but distinct concept.
2. Concept Description: In each round, players take turns describing their concept while trying to hide their identity and infer others’.
3. LLM Evaluation: Statements are scored by LLM judges based on novelty, relevance, and reasonableness.
4. Threshold-Based Elimination: If a player’s score falls below a predefined threshold, they are automatically eliminated.
5. Voting Round: After a fixed number of rounds, all surviving players vote to eliminate one player based on the dialogue so far.
6. Win Condition Check: The game ends when:
[All undercover agents are eliminated → civilians win] [Undercover agents equal civilians → undercover wins] [Maximum number of rounds is reached]

Experimental Results

We have presented some experimental results here. For more comprehensive and specific information about the experiments, please refer to the article.

The win rate performance of six LLMs across 11 categories. A comparative analysis reveals that each model exhibits distinct strengths and weaknesses across different concept categories. These variations are likely influenced by differences in training data, architectural design, and optimization strategies specific to each model. The analysis reveals models’ focus areas, knowledge gaps, and insights for improving conceptual reasoning.

Relevance scores of different LLMs across various categories. In this heatmap, the darker the color, the higher the score, intuitively reflecting the association between the descriptions and concepts of each LLM in different categories.

The t-SNE visualization of all embedded statements in the Tools category for GPT-4o and Gemini-2.0pro-exp. It shows that the distribution of Gemini-2.0-pro-exp’s statements is more widespread, while GPT-4o’s distribution is more concentrated. This indicates that Gemini-2.0-pro-exp captures a broader range of conceptual knowledge, which indirectly reflects a deeper understanding of concepts.

BibTeX

If you need to cite our work:

@article{xu2025probe,
  title={Probe by Gaming: A Game-based Benchmark for Assessing Conceptual Knowledge in LLMs},
  author={Shuhang Xu and Weijian Deng and Yixuan Zhou and Fangwei Zhong},
  journal={arXiv preprint arXiv:2505.17512},
  year={2025}
}

Acknowledgements

We sincerely acknowledge the foundational work of MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration, which first applied the Undercover game to the study of LLM-based agents. It systematically explored the game’s potential in agent evaluation, providing critical insights for our research.

@misc{xu2024magicinvestigationlargelanguage,
  title={MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration}, 
  author={Lin Xu and Zhiyuan Hu and Daquan Zhou and Hongyu Ren and Zhen Dong and Kurt Keutzer and See Kiong Ng and Jiashi Feng},
  journal={arXiv preprint arXiv:2311.08562},
  year={2024},
}